Skip to content

Fix: strip diacritics in step 3 CSV filter#2

Merged
jakebromberg merged 1 commit intomainfrom
fix/filter-diacritics
Feb 12, 2026
Merged

Fix: strip diacritics in step 3 CSV filter#2
jakebromberg merged 1 commit intomainfrom
fix/filter-diacritics

Conversation

@jakebromberg
Copy link
Member

Summary

  • The library stores ASCII artist names ("Bjork") but Discogs uses diacritics ("Björk")
  • Step 3's normalize_artist() only did .lower().strip(), so "björk" != "bjork" and all releases for those artists were silently excluded from the cache
  • Now uses unicodedata.normalize('NFKD') to strip diacritics before comparing, matching the approach already used in step 8's verify_cache.py

Test plan

  • Existing TestNormalizeArtist cases still pass
  • New parametrized cases for Bjork, Sigur Ros, Motorhead, Husker Du, Cafe Tacvba, Zoe
  • Full test_filter_csv.py suite passes (28 tests)
  • Next pipeline run should pick up previously-missed diacritics artists

The library stores ASCII names ("Bjork") but Discogs uses diacritics
("Björk"). The step 3 filter compared with .lower().strip() only, so
all releases for diacritics artists were silently excluded from the cache.
@jakebromberg jakebromberg merged commit 27cdc42 into main Feb 12, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant